AITopics | training token

Collaborating Authors

training token

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hybrid Norm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Neural Information Processing SystemsJun-15-2026, 16:17:57 GMT

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose HybridNorm, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling

Neural Information Processing SystemsJun-14-2026, 06:40:43 GMT

Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Neural Information Processing SystemsJun-12-2026, 18:00:09 GMT

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, X-EcoMLA, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. X-EcoMLA achieves Transformer-level accuracy with near-SSM efficiency using only 7-11 billion training tokens (compared to the trillions required for pre-training) and an 8B teacher. Moreover, it dramatically reduces KV cache size--down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively--while preserving 100%, 100%, and over 97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, our approach consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Not All Tokens Are What You Need for Pretraining

Neural Information Processing SystemsApr-26-2026, 17:18:33 GMT

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens.

artificial intelligence, machine learning, natural language, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.53)
Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

8066ae1446b2bbccb5159587cc3b3bcc-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 04:06:07 GMT

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

Appendix

Neural Information Processing SystemsFeb-11-2026, 18:08:07 GMT

H.3 WinogenderSetup We follow the same setup as in Rae et al.[38].

gopher, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 18:08:04 GMT

gopher, language model, model size, (14 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

An empirical analysis of compute-optimal large language model training

Neural Information Processing SystemsDec-25-2025, 04:41:34 GMT

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more data.

empirical analysis, language model training, name change, (5 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.07)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.42)

Add feedback

ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

Kadasi, Pritam, Upperwal, Abhishek, SIngh, Mayank

arXiv.org Artificial IntelligenceDec-5-2025

We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1\%$, $5\%$, and $10\%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.04555

Country:

Asia (0.93)
Europe (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Thinking Augmented Pre-training

Wang, Liang, Yang, Nan, Huang, Shaohan, Dong, Li, Wei, Furu

arXiv.org Artificial IntelligenceOct-20-2025

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.20186

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback